Load the Libraries

About GenBank files and how to deal with them

We are dealing with .gb/.gbk files with single records each.

From (https://warwick.ac.uk/fac/sci/moac/people/students/peter_cock/python/genbank/)-

Depending on the type of GenBank file(s) you are interested in, they will either contain a single record, or multiple records. You can easily determine this by looking at the raw file - each record will start with a LOCUS line, followed by various other header lines, usually a list of features, the sequence data, and ends with a // line (slash slash).

Locations provided by BioPython is optimum for python purposes.

This way we can directly slice seq string using locations provided to obtain the seq for the features of our interest.

  1. The DDBJ/ENA/GenBank Feature Table Definition: Documentation of features in genbank files. Very good document, must go-through this once.

    Source:

  1. Locus tag: Locus_tags are identifiers that are systematically applied to every gene in a genome. These tags have become surrogate gene names by the biological community. If two submitters of two different genomes use the same systematic names to describe two very different genes in two very different genomes, it can be very confusing. In order to prevent this from happening INSD has created a registry of locus_tag prefixes. Submitters of eukaryotic and prokaryotic genomes should register their prefix prior to submitting their genome. All components of a project (such as multiple chromosomes or plasmids, etc) should use the same locus_tag prefix.

    Source:

Extracting desirable information

Useful videos for the analysis done later-

  1. https://www.youtube.com/watch?v=LdQV3cbUwEE&list=PLe1-kjuYBZ05T9iHV_z60B9mpFt201ND5&index=8
  2. https://www.youtube.com/watch?v=HP7ThAj_f1E

Both videos are on Youtube @Bioinformatics Coach

KeyError resolution -

gene_name = gene.qualifiers['gene'][0]
gene_name = gene.qualifiers.get('gene',['unavailable'])[0]

Source: https://bioinformatics.stackexchange.com/questions/15454/keyerror-when-getting-features-from-a-genbank-file-with-biopython-with-some-acce/15456#15456

NOTE: Our file for Staphylococcus aureus (ATCC® 43300™) (https://genomes.atcc.org/genomes/79691302ed634fef) had only CDS as features. So, a code-block was added to handle all such files which will have only CDS as features instead of the usual both genes and CDS as features.

How to iterate over a given directory: https://stackoverflow.com/questions/10377998/how-can-i-iterate-over-files-in-a-given-directory

Motif analysis

How to iterate rows of a dataframe: https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas

Visualization

Loading the data

Preparing for batch-Visualization

Frequency plots

Division by summed value normalization: MA_sum- SWAN plots